Hybrid Speech Recognition

ABSTRACT

A hybrid speech recognition system uses a client-side speech recognition engine and a server-side speech recognition engine to produce speech recognition results for the same speech. An arbitration engine produces speech recognition output based on one or both of the client-side and server-side speech recognition results.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.12/890,280, filed on Sep. 24, 2010, entitled, “Hybrid SpeechRecognition”; which is a continuation of U.S. patent application Ser.No. 12/550,380, filed on Aug. 30, 2009, entitled, “Hybrid SpeechRecognition” (now U.S. Pat. No. 7,933,777, issued on Apr. 26, 2011);which claims the benefit of U.S. Prov. Pat. App. Ser. No. 61/093,220,filed on Aug. 29, 2008, entitled, “Hybrid Speech Recognition”; all ofwhich are hereby incorporated by reference herein.

BACKGROUND

A variety of automatic speech recognizers (ASRs) exist for performingfunctions such as converting speech into text and controlling theoperations of a computer in response to speech. Some applications ofautomatic speech recognizers require shorter turnaround times (theamount of time between when the speech is spoken and when the speechrecognizer produces output) than others in order to appear responsive tothe end user. For example, a speech recognizer that is used for a “live”speech recognition application, such as controlling the movement of anon-screen cursor, may require a shorter turnaround time (also referredto as a “response time”) than a speech recognizer that is used toproduce a transcript of a medical report.

The desired turnaround time may depend, for example, on the content ofthe speech utterance that is processed by the speech recognizer. Forexample, for a short command-and-control utterance, such as “closewindow,” a turnaround time above 500 ms may appear sluggish to the enduser. In contrast, for a long dictated sentence which the user desiresto transcribe into text, response times of 1000 ms may be acceptable tothe end user. In fact, in the latter case users may prefer longerresponse times because they may otherwise feel that their speech isbeing interrupted by the immediate display of text in response to theirspeech. For longer dictated passages, such as entire paragraphs, evenlonger response times of multiple seconds may be acceptable to the enduser.

In typical prior art speech recognition systems, improving response timewhile maintaining recognition accuracy requires increasing the computingresources (processing cycles and/or memory) that are dedicated toperforming speech recognition. Similarly, in typical prior art speechrecognition systems, recognition accuracy may typically be increasedwithout sacrificing response time only by increasing the computingresources that are dedicated to performing speech recognition. Oneexample of a consequence of these tradeoffs is that when porting a givenspeech recognizer from a desktop computer platform to an embeddedsystem, such as a cellular telephone, with fewer computing resources,recognition accuracy must typically be sacrificed if the same responsetime is to be maintained.

One known technique for overcoming these resource constraints in thecontext of embedded devices is to delegate some or all of the speechrecognition processing responsibility to a speech recognition serverthat is located remotely from the embedded device and which hassignificantly greater computing resources than the embedded device. Whena user speaks into the embedded device in this situation, the embeddeddevice does not attempt to recognize the speech using its own computingresources. Instead, the embedded device transmits the speech (or aprocessed form of it) over a network connection to the speechrecognition server, which recognizes the speech using its greatercomputing resources and therefore produces recognition results morequickly than the embedded device could have produced with the sameaccuracy. The speech recognition server then transmits the results backover the network connection to the embedded device. Ideally thistechnique produces highly-accurate speech recognition results morequickly than would otherwise be possible using the embedded devicealone.

In practice, however, this use of server-side speech recognitiontechnique has a variety of shortcomings. In particular, becauseserver-side speech recognition relies on the availability of high-speedand reliable network connections, the technique breaks down if suchconnections are not available when needed. For example, the potentialincreases in speed made possible by server-side speech recognition maybe negated by use of a network connection without sufficiently highbandwidth. As one example, the typical network latency of an HTTP callto a remote server can range from 100 ms to 500 ms. If spoken dataarrives at a speech recognition server 500 ms after it is spoken, itwill be impossible for that server to produce results quickly enough tosatisfy the minimum turnaround time (500 ms) required bycommand-and-control applications. As a result, even the fastest speechrecognition server will produce results that appear sluggish if used incombination with a slow network connection.

What is needed, therefore, are improved techniques for producinghigh-quality speech recognition results for embedded devices within theturnaround times required by those devices, but without requiringlow-latency high-availability network connections.

SUMMARY

A hybrid speech recognition system uses a client-side speech recognitionengine and a server-side speech recognition engine to produce speechrecognition results for the same speech. An arbitration engine producesspeech recognition output based on one or both of the client-side andserver-side speech recognition results.

Other features and advantages of various aspects and embodiments of thepresent invention will become apparent from the following descriptionand from the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a dataflow diagram of a speech recognition system according toone embodiment of the present invention;

FIG. 2 is a flowchart of a method performed by the system of FIG. 1according to one embodiment of the present invention;

FIGS. 3A-3E are flowcharts of methods performed by an arbitration engineto produce hybrid speech recognition output according to variousembodiments of the present invention; and

FIGS. 4A-4F are flowcharts of methods performed by a speech recognitionsystem to process overlapping recognition results from multiple speechrecognition engines according to various embodiments of the presentinvention.

DETAILED DESCRIPTION

Referring to FIG. 1, a dataflow diagram is shown of a speech recognitionsystem 100 according to one embodiment of the present invention.Referring to FIG. 2, a flowchart is shown of a method 200 performed bythe system 100 of FIG. 1 according to one embodiment of the presentinvention.

A user 102 of a client device 106 speaks and thereby provides speech 104to the client device (step 202). The client device 106 may be anydevice, such as a desktop or laptop computer, cellular telephone,personal digital assistant (PDA), or telephone. Embodiments of thepresent invention, however, are particularly useful in conjunction withresource-constrained clients, such as computers or mobile computingdevices with slow processors or small amounts of memory, or computersrunning resource-intensive software. The device 106 may receive thespeech 104 from the user 102 in any way, such as through a microphoneconnected to a sound card. The speech 104 may be embodied in an audiosignal which is tangibly stored in a computer-readable medium and/ortransmitted over a network connection or other channel.

The client device 106 includes an application 108, such as atranscription application or other application which needs to recognizethe speech 104. The application 108 transmits the speech 104 to adelegation engine 110 (step 204). Alternatively, the application 108 mayprocess the speech 104 in some way and provide the processed version ofthe speech 104, or other data derived from the speech 104, to thedelegation engine 110. The delegation engine 110 itself may process thespeech 104 (in addition to or instead of any processing performed on thespeech by the application) in preparation for transmitting the speechfor recognition.

The delegation engine 110 may present the same interface to theapplication 108 as that presented by a conventional automatic speechrecognition engine. As a result, the application 108 may provide thespeech 104 to the delegation engine 110 in the same way that it wouldprovide the speech 104 directly to a conventional speech recognitionengine. The creator of the application 108, therefore, need not knowthat the delegation engine 110 is not itself a conventional speechrecognition engine. As will be described in more detail below, thedelegation engine 110 also provides speech recognition results back tothe application 108 in the same manner as a conventional speechrecognition engine. Therefore, the delegation engine 110 appears toperform the same function as a conventional speech recognition enginefrom the perspective of the application 108.

The delegation engine 110 provides the speech 104 (or a processed formof the speech 104 or other data derived from the speech 104) to both aclient-side automatic speech recognition engine 112 in the client device106 (step 206) and to a server-side automatic speech recognition engine120 in a server 118 located remotely over a network 116 (step 208). Theserver 118 may be a computing device which has significantly greatercomputing resources than the client device.

The client-side speech recognizer 112 and server-side speech recognizer120 may be conventional speech recognizers. The client-side speechrecognizer 112 and server-side speech recognizer 120 may, however,differ from each other. For example, the server-side speech recognizer120 may use more complex speech recognition models which require morecomputing resources than those used by the client-side speech recognizer112. As another example, one of the speech recognizers 112 and 120 maybe speaker-independent, while the other may be adapted to the voice ofthe user 102. The client-side recognizer 112 and server-side recognizer120 may have different response times due to a combination ofdifferences in the computing resources of the client 106 and server 118,differences in the speech recognizers themselves 112 and 120, and thefact that the results from the server-side recognizer 120 must beprovided back to the client device 106 over the network 116, therebyintroducing latency not incurred by the client-side recognizer 112.

Responsibilities may be divided between the client-side speechrecognizer 112 and server-side speech recognizer 120 in various ways,whether or not such recognizers 112 and 120 differ from each other. Forexample, the client-side speech recognizer 112 may be used solely forcommand-and-control speech recognition, while the server-side speechrecognizer 112 may be used for both command-and-control and dictationrecognition. As another example, the client-side recognizer 112 may onlybe permitted to utilize up to a predetermined maximum percentage ofprocessor time on the client device 106. The delegation engine 110 maybe configured to transmit appropriate speech to the client-siderecognizer 112 and server-side recognizer 120 in accordance with theresponsibilities of each.

The client-side recognizer 112 produces speech recognition results 114,such as text based on the speech 104 (step 210). Similarly, theserver-side recognizer 120 produces speech recognition results 122, suchas text based on the speech 104 (step 212). The results 114 may includeother information, such as the set of best candidate words, confidencemeasurements associated with those words, and other output typicallyprovided by speech recognition engines.

The client-side results 114 and server-side results 122 may differ fromeach other. The client-side recognizer 112 and server-side recognizer120 both provide their results 114 and 112, respectively, to anarbitration engine 124 in the client device 106. The arbitration engine124 analyzes one or both of the results 114 and 122 to decide which ofthe two results 114 and 122 to provide (as results 126) to thedelegation engine 110 (step 214). As will be described in more detailbelow, the arbitration engine 124 may perform step 214 either afterreceiving both of the results 114 and 122, or after receiving one of theresults 114 and 122 but not the other. Therefore, in general thearbitration engine 124 produces the output 126 based on the client-sideresults 114 and/or the server-side results 122.

The delegation engine 110 provides the selected results 126 back to therequesting application 108 (step 216). As a result, the requestingapplication 108 receives speech recognition results 126 back from thedelegation engine 110 as if the delegation engine 110 were a single,integrated speech recognition engine 110. In other words, the details ofthe operations performed by the delegation engine 110 and arbitrationengine 124 are hidden from the requesting application 108.

The arbitration engine 124 may use any of a variety techniques to selectwhich of the client-side results 114 and server-side results 122 toprovide to the delegation engine 110. For example, as illustrated by themethod 300 of FIG. 3A, the arbitration engine 124 may select theclient-side results 114 as soon as those results 114 become available(step 302), if the server-side recognizer 120 is not accessible over thenetwork (e.g., if the connection between the client 106 and the network116 is down) (steps 304-306).

Conversely, as illustrated by the method 310 of FIG. 3B, the arbitrationengine 124 may select the server-side results 122 as soon as thoseresults 122 become available (step 312), if the client-side recognizer112 is not accessible (steps 314-316). This may occur, for example, ifthe client-side recognizer 112 has been disabled as a result of ahigh-priority CPU task being executed on the client device 106.

As another example, and assuming that the server-side recognizer 120provides, on average, higher-quality recognition results than theclient-side recognizer 112, the arbitration engine 124 may select theserver-side recognizer's results 122 if those results 122 becomeavailable no later than a predetermined waiting time after theclient-side recognizer's results 114 became available. In other words,as illustrated by the method 320 of FIG. 3C, once the client-siderecognizer's results 114 become available (step 322), the arbitrationengine 124 may return the server-side results 122 (step 330) only ifthey are received (step 324) before the predetermined waiting time haspassed (step 326). If the server-side results 122 are not available bythat time, then the arbitration engine 124 may return the client-sideresults 114 (step 328).

The predetermined waiting time may be selected in any way. For example,the predetermined waiting time may depend on the type of recognitionresult. For example, the predetermined waiting time applied by themethod 320 to command-and-control grammars may be selected to be shorterthan the predetermined waiting time applied to dictation grammars. Asjust one example, a predetermined waiting time of 500 ms may be appliedto command-and-control grammars, while a predetermined waiting time of1000 ms may be applied to dictation grammars.

As yet another example, and as illustrated by the method 340 of FIG. 3D,even assuming that the server-side recognizer 120 provides, on average,higher-quality recognition results than the client-side recognizer 112,the arbitration engine 124 may select the client-side recognizer'sresults 114 (step 346) as soon as those results 114 become available(step 342), if the confidence measure associated with those results 114exceeds some predetermined threshold value (step 344).

The arbitration engine 124 is not limited to “selecting” one or theother of the results 114 and 122 produced by the client-side recognizer112 and server-side recognizer 120, respectively. Rather, for example,as illustrated by the method 350 of FIG. 3E, the arbitration engine 124may receive the results 114 and 122 (steps 352 and 354), and combine orotherwise process those results 114 and 122 in various ways (step 356)to produce the output 126 provided back to the requesting application108 (step 358). For example, the arbitration engine 124 may combine theresults 114 and 122 using a well-known technology named ROVER(Recognizer Output Voting Error Reduction), or using other techniques,to produce the output 126.

The arbitration engine 124 may combine the techniques disclosed abovewith respect to FIGS. 3A-3E, and with other techniques, in anycombination. For example, the method 340 of FIG. 3D may be combined withthe method 320 of FIG. 3C by performing steps 344 and 346 of method 340after step 322 in FIG. 3C, and proceeding to step 324 of FIG. 3C if theconfidence measure in step 344 does not exceed the threshold.

It is possible for results from one of the recognizers 112 and 120 tooverlap in time with the results from the other recognizer, asillustrated by the method 400 of FIG. 4A. For example, assume that thespeech 104 is five seconds in duration, and that the client-siderecognizer 112 produces high-confidence results 114 for the first twoseconds of the speech 104 (step 402). As a result of the high confidencemeasure of the results 114, the arbitration engine 124 may submit thoseresults 114 to the delegation engine 110, which commits those results114 (i.e., includes the results 114 in the results 126 that are passedback to the application 108) before the server-side results 122 becomeavailable (step 404). Then, when the server-side results 122 for some orall of the same five seconds of speech 104 become available, some or allof those results 122 may conflict (overlap in time) with some or all theclient-side results 114 (step 406). The arbitration engine 124 may takeaction in response to such overlap (step 408).

For example, as shown by the method 410 of FIG. 4B, if the client-sideresults 114 and the server-side results 122 overlap by less than somepredetermined threshold time period (e.g., 100 ms) (step 412), then thearbitration engine 124 may consider results 114 and 122 to benon-overlapping and process them in any of the ways described above withrespect to FIGS. 3A-3E (step 414). Otherwise, the arbitration engine 124may consider the results 114 and 122 to be overlapping and process themaccordingly, such as in the ways described in the following examples(step 416).

For example, as illustrated by the method 420 of FIG. 4B, thearbitration engine 124 may consider one of the recognizers (e.g., theserver-side recognizer 120) to be preferred over the other recognizer.In this case, if results (e.g., client-side results 114) from thenon-preferred recognizer arrive first (step 422) and are committed first(step 424), and then results (e.g., server-side results 122) from thepreferred recognizer arrive (step 428) which overlap with thepreviously-committed non-preferred results, the arbitration engine 124may commit (i.e., include in the hybrid results 126) the preferredresults (e.g., server-side results 122) as well (step 430). Althoughthis results in certain portions of the speech 104 being committedtwice, this may produce more desirable results than discarding theresults of a preferred recognizer. If the later-received results are notfrom the preferred recognizer, those results may be discarded ratherthan committed (step 432).

As yet another example, as illustrated by the method 440 of FIG. 4D, ifresults (e.g., server-side results 122) from the preferred recognizerarrive first (step 442) and are committed first (step 444), and thenresults (e.g., client-side results 114) from the non-preferredrecognizer arrive which overlap with the previously-committed preferredresults (steps 446 and 448), then the arbitration engine 124 may discardthe non-preferred results (step 450). Otherwise, the arbitration engine124 may commit the later-received results or process them in anothermanner (step 452).

More generally, as illustrated by FIG. 4E (which represents oneembodiment of step 408 of FIG. 4A), if the arbitration engine 124receives recognition results which overlap with any previously-committedresult received from (the same or different) speech recognizer, then thearbitration engine 124 may ignore the words from the new recognitionresults that overlap in time with the words from the old recognitionresults (using timestamps associated with each word in both recognitionresults) (step 462), and then commit the remaining (non-overlapping)words from the new recognition results (step 464).

As yet another example, as illustrated by FIG. 4F (which represents oneembodiment of step 408 of FIG. 4A), if the arbitration engine 124receives recognition results which overlap with any previously-committedresult received from (the same or different) speech recognizer, then thearbitration engine 124 may use the newly-received results to update thepreviously-committed results (step 472). For example, the arbitrationengine 124 may determine whether the confidence measure associated withthe newly-received results exceeds the confidence measure associatedwith the previously-committed results (step 474) and, if so, replace thepreviously-committed results with the newly-received results (step 476).

Embodiments of the present invention have a variety of advantages. Ingeneral, embodiments of the invention enable a client-side device, suchas a cellular telephone, having limited resources to obtain high-qualityspeech recognition results within predetermined turnaround timerequirements without requiring a high-availability, high-bandwidthnetwork connection. The techniques disclosed herein effectively producea hybrid speech recognition engine which uses both the client-siderecognizer 112 and server-side recognizer 118 to produce better resultsthan either of those recognizers could have produced individually. Morespecifically, the resulting hybrid result can have better operatingcharacteristics with respect to system availability, recognitionquality, and response time than could be obtained from either of thecomponent recognizers 112 and 120 individually.

For example, the techniques disclosed herein may be used to satisfy theuser's turnaround time requirements even as the availability of thenetwork 116 fluctuates over time, and even as the processing load on theCPU of the client device 106 fluctuates over time. Such flexibilityresults from the ability of the arbitration engine 124 to respond tochanges in the turnaround times of the client-side recognizer 112 andserver-side recognizer 120, and in response to other time-varyingfactors. Embodiments of the present invention thereby provide a distinctbenefit over conventional server-side speech recognition techniques,which break down if the network slows down or becomes unavailable.

Hybrid speech recognition systems implemented in accordance withembodiments of the present invention may provide higher speechrecognition accuracy than is provided by the faster of the two componentrecognizers (e.g., the server-side recognizer 120 in FIG. 1). This is adistinct advantage over conventional server-side speech recognitiontechniques, which only provide results having the accuracy of theserver-side recognizer, since that is the only recognizer used by thesystem.

Similarly, hybrid speech recognition systems implemented in accordancewith embodiments of the present invention may provide a faster averageresponse time than is provided by the slower of the two componentrecognizers (e.g., the client-side recognizer 112 in FIG. 1). This is adistinct advantage over conventional server-side speech recognitiontechniques, which only provide results having the response time of theserver-side recognizer, since that is the only recognizer used by thesystem.

Furthermore, embodiments of the present invention impose no constraintson the type or combinations of recognizers that may be used to form thehybrid system. Each of the client-side recognizer 112 and server-siderecognizer 120 may be any kind of recognizer. Each of them may be chosenwithout knowledge of the characteristics of the other. Multipleclient-side recognizers, possibly of different types, may be used inconjunction with a single server-side recognizer to effectively formmultiple hybrid recognition systems. Either of the client-siderecognizer 112 or server-side recognizer 120 may be modified or replacedwithout causing the hybrid system to break down. As a result, thetechniques disclosed herein provide a wide degree of flexibility thatmakes them suitable for use in conjunction with a wide variety ofclient-side and server-side recognizers.

Moreover, the techniques disclosed herein may be implemented withoutrequiring any modification to existing applications which rely on speechrecognition engines. As described above, for example, the delegationengine 110 may provide the same interface to the application 108 as aconventional speech recognition engine. As a result, the application 108may provide input to and receive output from the delegation engine 110as if the delegation engine 110 were a conventional speech recognitionengine. The delegation engine 110, therefore, may be inserted into theclient device 106 in place of a conventional speech recognition enginewithout requiring any modifications to the application 108.

It is to be understood that although the invention has been describedabove in terms of particular embodiments, the foregoing embodiments areprovided as illustrative only, and do not limit or define the scope ofthe invention. Various other embodiments, including but not limited tothe following, are also within the scope of the claims. For example,elements and components described herein may be further divided intoadditional components or joined together to form fewer components forperforming the same functions.

The techniques described above may be implemented, for example, inhardware, software tangibly stored on a computer-readable medium,firmware, or any combination thereof. The techniques described above maybe implemented in one or more computer programs executing on aprogrammable computer including a processor, a storage medium readableby the processor (including, for example, volatile and non-volatilememory and/or storage elements), at least one input device, and at leastone output device. Program code may be applied to input entered usingthe input device to perform the functions described and to generateoutput. The output may be provided to one or more output devices.

Each computer program within the scope of the claims below may beimplemented in any programming language, such as assembly language,machine language, a high-level procedural programming language, or anobject-oriented programming language. The programming language may, forexample, be a compiled or interpreted programming language.

Each such computer program may be implemented in a computer programproduct tangibly embodied in a machine-readable storage device forexecution by a computer processor. Method steps of the invention may beperformed by a computer processor executing a program tangibly embodiedon a computer-readable medium to perform functions of the invention byoperating on input and generating output. Suitable processors include,by way of example, both general and special purpose microprocessors.Generally, the processor receives instructions and data from a read-onlymemory and/or a random access memory. Storage devices suitable fortangibly embodying computer program instructions include, for example,all forms of non-volatile memory, such as semiconductor memory devices,including EPROM, EEPROM, and flash memory devices; magnetic disks suchas internal hard disks and removable disks; magneto-optical disks; andCD-ROMs. Any of the foregoing may be supplemented by, or incorporatedin, specially-designed ASICs (application-specific integrated circuits)or FPGAs (Field-Programmable Gate Arrays). A computer can generally alsoreceive programs and data from a storage medium such as an internal disk(not shown) or a removable disk. These elements will also be found in aconventional desktop or workstation computer as well as other computerssuitable for executing computer programs implementing the methodsdescribed herein, which may be used in conjunction with any digitalprint engine or marking engine, display monitor, or other raster outputdevice capable of producing color or gray scale pixels on paper, film,display screen, or other output medium.

1. A computer-implemented method performed by a client device, themethod comprising: (A) receiving a request from a requester to applyautomatic speech recognition to an audio signal; (B) providing the audiosignal to a first automatic speech recognition engine in the clientdevice; (C) receiving first speech recognition results from the firstautomatic speech recognition engine; (D) determining whether a secondautomatic speech recognition engine, in a server device, is accessibleto the client device; (E) if the second automatic speech recognitionengine is determined not to be accessible to the client device, thenproviding the first speech recognition results to the requester inresponse to the request.